Preprint ANL/MCS-P875 0401, April, 2001 

Mathematics and Computer Science Division 
Argonne National Laboratory 



Jeff Linderoth • Stephen Wright 

Decomposition Algorithms for Stochastic 
Programming on a Computational Grid 

February 1, 2008 

Abstract . We describe algorithms for two-stage stochastic linear programming with recourse 
and their implementation on a grid computing platform. In particular, we examine serial 
and asynchronous versions of the L-shaped method and a trust-region method. The parallel 
platform of choice is the dynamic, heterogeneous, opportunistic platform provided by the 
Condor system. The algorithms are of master-worker type (with the workers being used to 
solve second-stage problems), and the MW runtime support library (which supports master- 
worker computations) is key to the implementation. Computational results are presented on 
large sample average approximations of problems from the literature. 



1. Introduction 

Consider the following stochastic optimization problem: 

N 

mm F(x) = y^PiffaUi), (1) 

i=l 

where S G R n is a constraint set, fl = {wi,W2, . . . , ujn} is the set of outcomes 
(consisting of N distinct scenarios), and Pi is the probability associated with 
each scenario. Problems of the form ([!]) can arise directly (in many applications, 
the number of scenarios is naturally finite), or as discretizations of problems over 
continuous probability spaces, obtained by approximation or sampling. In this 
paper, we discuss the two-stage stochastic linear programming problem with fixed 
resource, which is a special case of (|j) defined as follows: 

mm c T x + J2iLiPt ( l( UJ l ) T y(uj l ), subject to (2a) 
Ax = b, x>0, (2b) 
Wy(wi) = h{ui)-T(ui)x, y(u>i)>0, i = l,2,...,N. (2c) 

The unknowns in this formulation are x and y(u>i),y(u>2), ■ ■ ■ , y(wjv), where x 
contains the "first-stage variables" and each contains the "second-stage 
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variables" associated with the ith scenario. The ith scenario is characterized by 
the probability pi and the data objects (q(u>i), T(u>i), h(uji)). 

The formulation (^|) is sometimes known as the "deterministic equivalent" 
because it lists the unknowns for all scenarios explicitly and poses the problem as 
a (potentially very large) structured linear program. An alternative formulation 
is obtained by recognizing that each term in the second-stage summation in ( pa| ) 
is a piecewise linear convex function of x. Defining the ith second-stage problem 
as a linear program (LP) parametrized by the first-stage variables x, that is, 

Qi(x) d = min y ( Wi) q{uji) T y{uji) subject to (3a) 
Wy{uji) = h{uoi) - T(ui)x, y{oJi) > 0, (3b) 

and defining the objective in ( |2a| ) as 

N 

Q(x) d ^c T x + J2PiQi(^ (4) 

we can restate ^ as 

min Q(x), subject to Ax = b, x > 0. (5) 

X 

We note several features about the problem (||). First, it is clear from (0) 
and (|) that Q(x) can be evaluated for a given x by solving the N linear pro- 
grams (||) separately. Second, we can derive subgradient information for Qi{x) 
by considering dual solutions of (^|). If we fix x = x in (^|), the primal solution 
yiuii) and dual solution n{u)i) satisfy the following optimality conditions: 

q(u t ) - W T ir(uJi) > _L y(wt) > 0, 

Wy(u>i) = h((jJi) - T(uoi)x. 

From these two conditions we obtain that 

Qi(x) = q(u)i) T y(u)i) = n(Ui) T Wy(tAJi) = 7r(w i ) T [/i(w i ) - T{uj t )x\. (6) 

Moreover, since Qi is piecewise linear and convex, we have for any x that 

Qi(x) - Qi(x) > w(oJi) T [-T(uJi)x + T{u t )x] = (-T(^) T 7r( Wl )) T (x - x), (7) 

which implies that 

-T(Wi) T TT{LJi) e dQi(x), (8) 

where dQi(x) denotes the subgradient of Qi at x. By Rockafellar Theo- 
rem 23.8], using polyhedrality of each Qi, we have from (|J) that 

N 

dQ(x)=c + Y / PidQi{£), (9) 
i=i 

for every x that lies in the domain of each Qj, i = 1, 2, . . . , N. 
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Let <S denote the solution set for (||) ; we assume for most of the paper that 
S is nonempty. Since ^ is a convex program, S is closed and convex, and the 
projection operator P(-) onto S is well defined. Because the objective function 
in (JsJ) is piecewise linear and the constraints are linear, the problem has a weak 
sharp minimum (Burke and Ferris Q); that is, there exists e > such that 

Q(x)-Q*>e\\x-P(x)\\ 00 , for all x with Ax = b, x > 0, (10) 

where Q* is the optimal value of the objective. 

The subgradient information can be used by algorithms in different ways. 
Successive estimates of the optimal x can be obtained by minimizing over a 
convex underestimate of Q(x) constructed from subgradients obtained at earlier 
iterations, as in the L-shaped method described in Section ||. This method can 
be stabilized by the use of a quadratic regularization term (Ruszczyhski pjfl , 
Kiwiel [[l6|) or by the explicit use of a trust region, as in the 1^ trust-region 
approach described in Section [| Alternatively, when an upper bound on the op- 
timal value Q* is available, one can derive each new iterate from an approximate 
analytic center of an approximate epigraph. The latter approach has been ex- 
plored by Bahn et al. JIJ and applied to a large stochastic programming problem 
by Frangiere, Gondzio, and Vial ||. 

Because evaluation of Qi(x) and elements of its subdifferential can be carried 
out independently for each i = 1,2, ... ,N, and because such evaluations usually 
constitute the bulk of the computational workload, implementation on parallel 
computers is possible. We can partition second-stage scenarios i = 1,2, ... ,N 
into "chunks" and define a computational task to be the solution of all the LPs 
in a single chunk. Each such task could be assigned to an available worker 
processor. Relationships between the solutions of (|^) for different scenarios can 
be exploited within each chunk (see Birge and Louveaux J5|, Section 5.4]). The 
number of second-stage LPs in each chunk should be chosen to ensure that the 
computation does not become communication bound. That is, each chunk should 
be large enough that its processing time significantly exceeds the time required 
to send the data to the worker processor and to return the results. 

In this paper, we describe implementations of decomposition algorithms for 
stochastic programming on a dynamic, heterogeneous computational grid made 
up of workstations, PCs (from clusters), and supercomputer nodes. Specifically, 
we use the environment provided by the Condor system ]T^| . We also discuss 
the MW runtime library (Goux et al. [ fL3p^ ] ) , a software layer that significantly 
simplifies the process of implementing parallel algorithms in Condor. 

For the dimensions of problems and parallel platforms considered in this pa- 
per, evaluation of the functions Qi(x) and their subgradients at a single x often 
is insufficient to make effective use of the available processors. Moreover, "syn- 
chronous" algorithms — those that depend for efficiency on all tasks completing 
in a timely fashion — run the risk of poor performance in an environment such as 
ours, in which failure or suspension of worker processors while they are process- 
ing a task is not an infrequent event. We are led therefore to "asynchronous" 
approaches that consider different points x simultaneously. Asynchronous vari- 
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ants of the L-shaped and l x trust-region methods are described in Sections 2.2 
and respectively. 

Other parallel algorithms for stochastic programming have been devised by 
Birge et al. Birge and Qi ||, and Frangiere, Gondzio, and Vial §. In @, the 
focus is on multistage problems in which the scenario tree is decomposed into 
subtrees, which are processed independently and in parallel on worker processors. 
Dual solutions from each subtree are used to construct a model of the first- 
stage objective (using an L-shaped approach like that described in Section ^), 
which is periodically solved by a master process to obtain a new candidate 
first-stage solution x. Parallelization of the linear algebra operations in interior- 
point algorithms is considered in ||, but this approach involves significant data 
movement and does not scale particularly well. In ^] , the second-stage problems 
(||) are solved concurrently and inexactly by using an interior-point code. The 
master process maintains an upper bound on the optimal objective, and this 
bound along with the subgradients obtained from the second-stage problems 
yields a polygon whose (approximate) analytic center is calculated periodically 
to obtain a new candidate x. The approach is based in part on an algorithm 
described by Gondzio and Vial Jll| . The numerical results in 13] report solution 
of a two-stage stochastic linear program with 2.6 million variables and 1.2 million 
constraints in three hours on a cluster of 10 Linux PCs. 



2. L-Shaped Methods 

We now describe the L-shaped method, a fundamental algorithm for solving 
and an asynchronous variant. 



2.1. The Multicut L-Shaped Method 

The L-shaped method of Van Slyke and Wets |25| for solving (|J) proceeds by 
finding subgradients of partial sums of the terms that make up Q (IJ), together 
with linear inequalities that define the domain of Q. The method is essentially 
Benders decomposition enhanced to deal with infeasible iterates. A full de- 
scription is given in Chapter 5 of Birge and Louveaux |5) . We sketch the approach 
here and show how it can be implemented in an asynchronous fashion. 

We suppose that the second-stage scenarios indexed by 1, 2, . . . , N are parti- 
tioned into T clusters denoted by Ai, A/2, . . . , At- Let Qy] represent the partial 
sum from ^ corresponding to the cluster A/j : 

fibiO=)= X>fii(*)- ( n ) 

The algorithm maintains a model function m£i , which is a piecewise linear lower 
bound on Qy-i for each j. We define this function at iteration k by 

m^x)=M{9 j \9 j e>F^x + f^}, (12) 
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where FK is a matrix whose rows are subgradients of at previous iterates of 
the algorithm, and e = (1, 1, . . . , 1) T . The rows of 9je > FKx + fK are referred 
to as optimality cuts. Upon evaluating Q^-j at the new iterate x k by solving (^|) 
for each i e Afj, a subgradient Qj G dQ^ can be obtained from a formula derived 
from (||) and (Q), namely, 

9j = - ^2 p l T(u) l ) T Tt{uj l ), (13) 

where each n(u)i) is an optimal dual solution of (||). Since by the subgradient 
property we have 

Q m {x)>gjx+{Q m {x k )-gJx k ), 

we can obtain F^f 1 from Fh, by appending the row gj ', and /uj from fK by 
appending the element (Q[j](x k ) — gjx k ). In order to keep the number of cuts 
reasonable, the cut is not added if is not greater than the value predicted 
by the lower bounding approximation (see (|l7|) below). In this case, the current 
set of cuts in FK, fh adequately models Qyj. In addition, we may also wish 

to delete some rows from F^t 1 , fut 1 corresponding to facets of the epigraph of 
( |l2| ) that we do not expect to be active in later iterations. 

The algorithm also maintains a collection of feasibility cuts of the form 

D k x > d k , (14) 

which have the effect of excluding values of x that were found to be infeasible, 
in the sense that some of the second-stage linear programs (0) are infeasible 
for these values of x. By Farkas's theorem (see Mangasarian pi p. 31]), if the 
constraints ( |3b| ) are infeasible, there exists n(uji) with the following properties: 

W T Tr(u>i) < 0, [h(u>i) - T^xf 7r( Wl ) > 0. 

(In fact, such a 7r(cjj) can be obtained from the dual simplex method for the 



feasibility problem (3_b).) To exclude this x from further consideration, we simply 
add the inequality [h(u>i)— T(oji)x] T ir(oJi) < to the constraint set, by appending 
the row vector n(uji) T T{uji) to D k and the element ir(u>i) T h(uJi) to d k in (|l4|). 

The iterate x k of the multicut L-shaped method is obtained by solving the 
following approximation to (||) : 

min m k (x), subject to D k x > d k , Ax = b, x > 0, (15) 

X 

where 

T 

m k (x) = c T x + 2J m [j]( x )- (16) 
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In practice, we substitute from (|12|) to obtain the following linear program: 

T 

min c T cc + > 6a, subject to (17a) 
x ,e u ...,e T ^— ' 
i=i 

^e^F^ + Zg], j = l,2,...,T, (17b) 
D fc a; > d fc , (17c) 
Ax = b, x > 0. (17d) 



The L-shaped method proceeds by solving (|T7]) to generate a new candidate x, 
then evaluating the partial sums ( |ll|) and adding optimality and feasibility cuts 
as described above. The process is repeated, terminating when the improvement 
in objective promised by the subproblem ( |i~5| ) becomes small. 

For simplicity we make the following assumption for the remainder of the 
paper. 



Assumption 1. 

(i) The problem has complete recourse; that is, the feasible set of (^) is nonempty 

for all i — 1,2, ... ,N and all x, so that the domain of Q(x) in (Qj is R". 
(ii) The solution set S is nonempty. 



Under this assumption, feasibility cuts of the form (|14|), (17c) do not appear 
during the course of the algorithm. Our algorithms and their analysis can be 
generalized to handle situations in which Assumption ^ does not hold, but since 
our development is complex enough already, we postpone discussion of these 
generalizations to a future report. 

Using Assumption |l|, we can specify the L-shaped algorithm formally as fol- 
lows: 



Algorithm LS 

choose tolerance etolj 
choose starting point a; ; 

define initial model mo to be a piecewise linear underestimate of Q(x) 
such that rriQ^x ) = Q(x°) and ttiq is bounded below; 

Qmin «- Q(X°); 

for k = 0,1,2,... 

obtain x k+1 by solving jl5|); 
if Qmi„ ~ m k (x k+1 ) < e tol (l + |Q min |) 
STOP; 

evaluate function and subgradient information at x k+1 ; 
Q min ^min(Q min ,Q( 2 ; fc + 1 )); 
obtain m fc+1 by adding optimality cuts to m^; 
end(for). 
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2.2. An Asynchronous Parallel Variant of the L-Shaped Method 

The L-shaped approach lends itself naturally to implementation in a master- 
worker framework. The problem (^) is solved by the master process, while 
solution of each cluster Afj of second-stage problems, and generation of the as- 
sociated cuts, can be carried out by the worker processes running in parallel. 
This approach can be adapted for an asynchronous, unreliable environment in 
which the results from some second-stage clusters are not returned in a timely 
fashion. Rather than having all the worker processors sit idle while waiting for 
the tardy results, we can proceed without them, re-solving the master by using 
the additional cuts that were generated by the other second-stage clusters. 

We denote the model function simply by m for the asynchronous algorithm, 
rather than appending a subscript. Whenever the time comes to generate a new 
iterate, the current model is used. In practice, we would expect the algorithm 
to give different results each time it is executed, because of the unpredictable 
speed and order in which the functions are evaluated and subgradients generated. 
Because of Assumption [l], we can write the subproblem 

min m{x), subject to Ax = b, x > 0. (18) 

X 

Algorithm ALS, the asynchronous variant of the L-shaped method that we 
describe here, is made up of four key operations, three of which execute on the 
master processor and one of which runs on the workers. These operations are as 
follows: 

— partial_evaluate. This is the routine for evaluating Qu\{x) defined by ( |ll|) 
for a given x and j, in the process generating a subgradient gj of QuAx). It 
runs on a worker processor and returns its results to the master by activating 
the routine act_on_completed_task on the master processor. 

— evaluate. This routine, which runs on the master, simply places T tasks of 
the type partial_evaluate for a given x into the task pool for distribution 
to the worker processors as they become available. The completion of these 
T tasks is equivalent to evaluating Q(x). 

— initialize. This routine runs on the master processor and performs initial 
bookkeeping, culminating in a call to evaluate for the initial point x . 

— act_on_completed_task. This routine, which runs on the master, is activated 
whenever the results become available from a partial_evaluate task. It 
updates the model and increments a counter to keep track of the number of 
clusters that have been evaluated at each candidate point. When appropriate, 
it solves the master problem with the latest model to obtain a new candidate 
iterate and will call evaluate. 

In our implementation of both this algorithm and its more sophisticated 
cousin Algorithm ATR of Section [|, we may define a single task to consist of the 
evaluation of more than one cluster Afj. We may bundle, say, 5 or 10 clusters 
into a single task, in the interests of making the task large enough to justify the 
master's effort in packing its data and unpacking its results, and to maintain 
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the ratio of compute time to communication cost at a high level. For purposes 
of simplicity, however, we assume in the descriptions both of this algorithm and 
of ATR that each task consists of a single cluster. 

The implementation depends on a "synchronicity" parameter a which is the 
proportion of clusters that must be evaluated at a point to trigger the generation 
of a new candidate iterate. Typical values of a are in the range 0.25 to 0.9. 
A logical variable speceval^, keeps track of whether x k has yet triggered a 
new candidate. Initially, speceval fc is set to false, then set to true when the 
proportion of evaluated clusters passes the threshold a. 

We now specify all the methods making up Algorithm ALS. 

ALS: partial_evaluate(a; 9 , q, j, Qu][x q ), gj) 

Given x q , index q, and partition number j, evaluate Qlj'Ka; 9 ) from ( |ll| ) 

together with a partial subgradient gj from (|l3|); 
Activate act_on_completed_task(x 9 , q, j, Q\j](x q ), gj) on the master processor. 



ALS: evaluate(:c 9 , q) 

for j = 1,2, ... ,T (possibly concurrently) 

partial_evaluate(x 9 , q, j, Qui (x q ),gj); 
end (for) 



ALS: initialize 

choose tolerance etoi; 
choose starting point x°; 
choose threshold a € (0, 1]; 
Qmin <— oo; 

k <— 0, speceval «— false, to <— 0; 
evaluate (a; , 0). 



ALS: act_on_completed_task(a; 9 , q, j, Qy] (x q ), gj) 

tq < tq ~f~ 1, 

add Q[j](x q ) and cut gj to the model m; 
if t q = T 

Qmin <- min(Q min , Q{x q )); 
else if t q > crT and not speceval^ 

speceval^ <— true; 

k <- k + 1; 

solve current model problem (|l^) to obtain x k 
if Qmi„ - m{x k+1 ) < e tol (l + |Q min |) 

STOP; 
evaluate(a; fe , k); 
end (if) 
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We present results for Algorithm ALS in Section |6| While the algorithm is 
able to use a large number of worker processors on our opportunistic platform, 
it suffers from the usual drawbacks of the L-shaped method, namely, that cuts, 
once generated, must be retained for the remainder of the computation to ensure 
convergence and that large steps are typically taken on early iterations before a 
sufficiently good model approximation to Q(x) is created, making it impossible 
to exploit prior knowledge about the location of the solution. 



3. A Bundle- Trust-Region Method 

Trust-region approaches can be implemented by making only minor modifica- 
tions to implementations of the L-shaped method, and they possesses several 
practical advantages along with stronger convergence properties. The trust- 
region methods we describe here are related to the regularized decomposition 
method of Ruszczyhski [^TJ and the bundle-trust-region approaches of Kiwiel fl6j 
and Hirart-Urruty and Lemarechal [fl4| Chapter XV]. The main differences are 
that we use box-shaped trust regions yielding linear programming subproblems 
(rather than quadratic programs) and that our methods manipulate the size of 
the trust region directly rather than indirectly via a regularization parameter. 

When requesting a subgradient of Q at some point x, our algorithms do not 
require particular (e.g., extreme) elements of the subdifferential to be supplied. 
Nor do they require the subdifferential dQ(x) to be representable as a convex 
combination of a finite number of vectors. In this respect, our algorithms contrast 
with that of Ruszczyhski [^lj, for instance, which exploits the piecewise- linear 
nature of the objectives Qi in (||). Because of our weaker conditions on the 
subgradient information, we cannot prove a finite termination result of the type 
presented in J2l], Section 3]. However, these conditions potentially allow our 
algorithms to be extended to a more general class of convex nondifferentiable 
functions. We hope to explore these generalizations in future work. 



3.1. A Method Based on £ x Trust Regions 

A key difference between the trust-region approach of this section and the re- 
shaped method of the preceding section is that we impose an norm bound 
on the size of the step. It is implemented by simply adding bound constraints to 
the linear programming subproblem ( |l7| ) as follows: 

- Ae < x - x k < Ae, (19) 

where e = (1, 1, . . . , 1) T , A is the trust-region radius, and x k is the current 
iterate. During the fcth iteration, it may be necessary to solve several problems 
with trust regions of the form ( p~S| ) , with different model functions m and possibly 
different values of A, before a satisfactory new iterate x k+1 is identified. We refer 
to x k and x k+1 as major iterates and the points x k ' e , I — 0, 1, 2, . . . obtained 
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by minimizing the current model function subject to the constraints and trust- 
region bounds of the form ( |l9| ) as minor iterates. Another key difference between 
the trust-region approach and the L-shaped approach is that a minor iterate x k ' e 
is accepted as the new major iterate x k+1 only if it yields a substantial reduction 
in the objective function Q over the previous iterate x k , in a sense to be defined 
below. A further important difference is that one can delete optimality cuts from 
the model functions, between minor and major iterations, without compromising 
the convergence properties of the algorithm. 

To specify the method, we need to augment the notation established in the 
previous section. We define rrik,e(x) to be the model function after I minor 
iterations have been performed at iteration k, and Ak,£ > to be the trust- 
region radius at the same stage. Under Assumption |l|, there are no feasibility 
cuts, so that the problem to be solved to obtain the minor iteration x k ' 1 is as 
follows: 

min mk.e(x) subject to Ax = b, x > 0, \\x — a; ||oo < Ai~ e (20) 
(cf. (|l5|)). By expanding this problem in a similar fashion to (^7|), we obtain 

T 

min c T a; + > 0j, subject to (21a) 
x,e u ...,s T K ' 

3 e>F k fx + f k f, j = l,2,...,T, (21b) 
Ax = 6, x > 0, (21c) 
- A k<i e < x - x k < A hil e. (21d) 

We assume the initial model mk,o a t major iteration k to satisfy the following 
two properties: 

m k:0 (x k ) = Q(x k ), (22a) 
rrikfi is a piecewise linear underestimate of Q. (22b) 

Denoting the solution of the subproblem ( pl| ) by x k,i , we accept this point 
as the new iterate x k+1 if the decrease in the actual objective Q (see (||)) is at 
least some fraction of the decrease predicted by the model function rrik,e- That 
is, for some constant £ € (0, 1/2), the acceptance test is 

Q(z M ) < Q(x k ) - £ (Q(x k ) - m k Ax k ' e )) . (23) 

(A typical value for £ is 10~ 4 .) 

If the test (|2^) fails to hold, we obtain a new model function rrik.e+i by 
adding and possibly deleting cuts from mk,e{x)- This process aims to refine the 
model function, so that it eventually generates a new major iteration, while 
economizing on storage by allowing deletion of subgradients that no longer seem 
helpful. Addition and deletion of cuts are implemented by adding and deleting 
rows from F k ^ e and /r .j , to obtain Fr .j +1 and /r.j +1 , for j = 1, 2, . . . , T. 

Given some parameter 77 G [0,1), we obtain rrik^+i from nik/ by means of 
the following procedure: 
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Procedure Model-Update (k,£) 
for each optimality cut 

possible_delete <— true; 

if the cut was generated at x k 

possible_delete <— false; 
else if the cut is active at the solution of ( pl| ) 

possible_delete <— false; 
else if the cut was generated at an earlier minor iteration 

I = 0,1,..., £- 1 such that 



Q{x k ) - m k Ax < ) > V [Q(xl ~ m kJ {x^)\ (24) 

possible_delete <— false; 
end (if) 

if possible_delete 

possibly delete the cut; 
end (for each) 

add optimality cuts obtained from each of the component functions 
Q[j] at x k ' e . 

In our implementation, we delete the cut if possible_delete is true at the 
final conditional statement and, in addition, the cut has not be en a ctive during 
the last 100 solutions of (^l|) . More details are given in Section x2 . 



Because we retain all cuts active at x during the course of major iteration 



fc, the following extension of (22a) holds: 

m k , e (x k ) = Q(x k ), £ = 0,1,2,.... (25) 



Since we add only subgradient information, the following generalization of (22b) 
also holds uniformly: 

rrik,e is a piecewise linear underestimate of Q, for I = 0, 1, 2, ... . (26) 

We may also decrease the trust-region radius Ak.e between minor iterations 
(that is, choose Ak,t+i < Ak,t) when the test ( p3| ) fails to hold. We do so if the 
match between model and objective appears to be particularly poor. If Q(x k ' ) 
exceeds Q(x k ) by more than an estimate of the quantity 

max Q(x k ) - Q{x), (27) 

\\x— rr fc ||oo<l 

we conclude that the "upside" variation of the function Q deviates too much 
from its "downside" variation, and we choose the new radius Ak,e+i to bring 
these quantities more nearly into line. Our estimate of ( |27j ) is simply 

. * , [Q(s*)-™m(s M )]. 
mm(l,Z\ M ) 

that is, an extrapolation of the model reduction on the current trust region to 
a trust region of radius 1. Our complete strategy for reducing A is therefore as 
follows. (The counter is initialized to zero at the start of each major iteration.) 
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Procedure Reduce-Z\ 

evaluate 



^^^'' ff^'i ' (28) 



if p > 

counter <— counter+1; 
if p > 3 or (counter > 3 and p G (1, 3]) 

set 



^k,e+i = — r-7 — TT^Mi 
mm(p, 4) 

reset counter «— 0; 

This procedure is related to the technique of Kiwiel 0, p. 109] for increasing 
the coefficient of the quadratic penalty term in his regularized bundle method. 

If the test ( ^3| ) is passed, so that we have x k+1 — x k,t , we have a great deal 
of flexibility in defining the new model function m k+ i^. We require only that 
the properties (^2| ) are satisfied, with k + 1 replacing k. Hence, we are free to 
delete much of the optimality cut information accumulated at iteration k (and 
previous iterates). In practice, of course, it is wise to delete only those cuts that 
have been inactive for a substantial number of iterations; otherwise we run the 
risk that many new function and subgradient evaluations will be required to 
restore useful model information that was deleted prematurely. 

If the step to the new major iteration x k+1 shows a particularly close match 
between the true function Q and the model function rrik,e at the last minor iter- 
ation of iteration k, we consider increasing the trust-region radius. Specifically, 
if 

Q(x k - e )<Q(x k )~0.5(Q(x k )-m k ^x k - e )), \\x k - x^U = A k , e , (29) 
then we set 

A h+1 , = min(A hu 2A kA ), (30) 

where Z\hi is a prespecified upper bound on the radius. 

Before specifying the algorithm formally, we define the convergence test. 
Given a parameter etoi > 0, we terminate if 

Q(x k ) - m k , e (x k ' £ ) < etoi(l + |e(s fc )|). (31) 

Algorithm TR 

choose £ £ (0, 1/2), maximum trust region A^, tolerance e t oi; 
choose starting point x°; 

define initial model mo,o with the properties (p3) (for k = 0); 
choose A .o £ (0, A hi \; 
for fc = 0,1,2,... 

f inishedMinor Iteration <— false; 
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£ *— 0; counter <— 0; 
repeat 

solve ( |20| ) to obtain x fe,£ ; 
if ( |3l| ) is satisfied 

STOP with approximate solution 
evaluate function and subgradient at x ' , 
if ( p3| ) is satisfied 

set a;*^ 1 = x fe ^; 

obtain m^-i^o by possibly deleting cuts from mk,e , but 

retaining the properties ( p2| ) (with fc + 1 replacing fc); 
choose As+i.o G [As.^hi] according to (||), @; 
f inishedMinorlteration ^- true; 

else 

obtain m^+i from via Procedure Model-Update 
obtain Ak,e+i y i a Procedure Reduce- A; 

e*-£ + i; 

until f inishedMinorlteration 
end (for) 



3.2. Analysis of the Trust-Region Method 

We now describe the convergence properties of Algorithm TR. We show that for 
£toi = 0, the algorithm either terminates at a solution or generates a sequence of 
major iterates that approaches the solution set S (Theorem ||). When e to ] > 0, 
the algorithm terminates finitely; that is, it avoids generating infinite sequences 
either of major or minor iterates (Theorem ^). 

Given some starting point a; satisfying the constraints Ax° — b, x° > 0, 
and setting Qq — Q(x°), we define the following quantities that are useful in 
describing and analyzing the algorithm: 

C(Qo) = {x\Ax = b,x>0,Q{x) < Q }, (32) 
£(Q ; A) = {x\\\x- y\\ < A, for some y e C(Qo)}, (33) 
13 = sup{|| 5 ||i | g e dQ{x), for some x 6 C(Q ; A hi )}. (34) 

Using Assumption [l], we can easily show that (3 < oo. 

We start by showing that the optimal objective value for (20) cannot decrease 
from one minor iteration to the next. 

Lemma 1. Suppose that x k ' e does not satisfy the acceptance test (pffi). Then we 
have 

m kfl {x h ' 1 ) < m k ,e+i{x k ' e+1 ). 



Proof. In obtaining mk.i+i from rrik,i in Model-Update, we do not allow deletion 

\j] and 4i 



of cuts that were active at the solution x k ' e of (|2~l|). Using F?/ and /r fe / to denote 
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the active rows in FrA and /,.! , we have that x k ' e is also the solution of the 
following linear program (in which the inactive cuts are not present): 



mm 
x,0i,...,e 



T 

(Fx + 6j, subject to (35a) 

3=1 

9je > F$x + f*f , j = 1 , 2, . . . , T, (35b) 

Ax = b, x > 0, (35c) 

A k je < x - x k < A kii e. (35d) 



The subproblem to be solved for x k ' e+1 differs from ( p5| ) in two ways. First, 
additional rows may be added to Fy! and fy\> consisting of function values 

and subgradients obtained at x k,e and also inactive cuts carried over from the 
previous (|2l|). Second, the trust-region radius A ky(+1 may be smaller than A k j. 
Hence, the feasible region of the problem to be solved for x k,e+1 is a subset of 
the feasible region for (p3), so the optimal objective value cannot be smaller. 



Next we have a result about the amount of reduction in the model function 

Lemma 2. For all k = 0, 1, 2, . . . and i — 0, 1, 2, . . we have that 
m kA (x k ) - m k ,e(x k ' e ) = Q(x k ) - m k j(x k < e ) 

> min (A k £, \\x k - F(x fc )|U) (36a) 

\\x k - P{x")\\oc 

> emin (A k j, \\x k - P(i fc )||oo) , (36b) 
where e > is defined in (\ll 



Proof. The first equality follows immediately from y2q ), while the second in- 
equality ( |36bj ) follows immediately from (36a) and (|10|). We now prove (36a). 
Consider the following subproblem in the scalar r: 

min m M (x k +r[P{x k ) - x k }) subject to ||r[P(a; fc ) - x k ] jj^ < A k ,i- (37) 

Denoting the solution of this problem by T kl t, we have by comparison with ( p(j|) 
that 

m kl e(x k ' e ) < m k j (x k + T k/ [P(x k ) - x k ]) . (38) 
If r = 1 is feasible in ([57|), we have from ( |38|) and (|26| ) that 

m k j(x k - e ) < m k j (x k + T k j[P(x k ) ~ x k }) 
< m M (x k + [P(x k ) - x k }) = m Kl (P{x k )) < Q(P(x k )) = Q*. 

Therefore, when r = 1 is feasible for (|3^), we have from (|2^) that 

m k ,i(x k ) - m k/ (x k ' e ) > Q(x k ) - Q*, 
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so that (36a) holds in this case. 

When t = 1 is infeasible for (p7| ), consider setting r = A k ^/\\x k — P(x k )\\ oc 
(which is certainly feasible for (|37|)). We have from (|3g|), the definition of r k ^, 
the fact (p6|) that underestimates Q, and convexity of Q that 



../..' , / ./■ \ -P(a: fe ) - £ fc 



m k ,e( x ) < m M + Z\ M 



||F(x fc ) -x fc || c 

:.0(.» + 4u 1 



||P(x fe ) -X fc || c 



Therefore, using (p5|), we have 



i»m(**) - m M (*W) > p^^-ie^) - Q% 



verifying ( |36a| ) in this case as well. 

Our next result finds a lower bound on the trust-region radii A k j. For pur- 
poses of this result we define a quantity E k to measure the closest approach to 
the solution set for all iterates up to and including x k , that is, 

E k = min H^-P^lloo. (39) 

fe=0,l,...,fc 

Note that E k decreases monotonically with k. We also define Z\j n i t to be the 
initial value of the trust region. 

Lemma 3. There is a constant A\ > such that for all trust regions Af.,i used 
in the course of Algorithm TR, we have 

A kl e > mm(A lo ,E k /4). 

Proof. We prove the result by showing that the value A\ a = (1/4) min(l, Z\i n ;t, e/(3) 
has the desired property, where e is from (|lo| ) and (3 is from (|34| ). 
Suppose for contradiction that there are indices k and £ such that 

A k ,e < | min f 1, -|, Z\ init , E k 

Since the trust region can be reduced by at most a factor of 4 by Procedure 
Reduce- A, there must be an earlier trust region radius A^, i (with k < k) such 
that 

AM-<min(l,^£k) . (40) 
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and p > 1 in (p8|), that is, 

e(« W ) - Q(x k ) > . * . > (g(**) - ms^V 
01111(1,^-) V y 

= -r^(Q(^)-m M V^))- (41) 

By applying Lemma ||, and using (|4^) , we have 

Q(x k ) - ro*,^) > emin [A %h \\x k - P^)^) = eA~ k>I (42) 

where the last equality follows from \\x k — P(a; fe )|| 00 > Ej. > E^ and (f40|). By 
combining (Elf) with pi]), we have that 



Q{x 1 ' 1 ) - Q(x k ) > e. (43) 
By using standard properties of subgradients, we have 
Q{x k ' 1 ) - Q{x k ) <gj{x k ' l ~x k ) 

< \\g- e \\i\\x k ~ ^lloo < \\9i\\iAu, for all 9l e dQ(a^). (44) 



By combining this expression with (43), and using ( [40|) again, we obtain that 

Wli>^>A 

Js,I 



However, since x ' G C(Qo; A^i), we have from fl34| ) that < /?, giving a 

contradiction. 

Finite termination of the inner iterations is proved in the following two re- 
sults. Recall that the parameters £ and r\ are defined in ( p3|) and (|2~I|), respec- 
tively. 

Lemma 4. Let e t oi — in Algorithm TR, and let fj be any constant satisfying 
< fj < 1, fj > ?7 > r\. Let l\ be any index such that x k,e± fails to satisfy the 
test (pffi). Then either the sequence of inner iterations eventually yields a point 
x k ' 12 satisfying the acceptance test nBw, or there is an index £2 > i\ such that 



Q(x k ) - m M2 (x k - £ > ) < fj [Q{x k ) - m kiil {x k ^ )} . (45) 

Proof. Suppose for contradiction that the none of the minor iterations following 
£\ satisfies either (E3T) or the criterion (Ha); that is, 



Q{x k ) - m Kq {x k «) > fj [Q(x k ) - m Ml {x k ^)} , 

> V [Q(x k ) - m feA (a; Ml )] , for all q> £ x . (46) 

It follows from this bound, together with Lemma [j] and Procedure Model- 
Update, that none of the cuts generated at minor iterations q > £\ is deleted. 
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We assume in the remainder of the proof that q and £ are generic minor 
iteration indices that satisfy 

q> £>£ 1 . 

Because the function and subgradients from minor iterations x k ' , I = l\, Zi + 
1, . . . are retained throughout the major iteration k, we have 

m k , q (x k < e ) = Q(x k ' e ). (47) 

By definition of the subgradient, we have 

m k , q {x) - m k , q (x k ' e ) > g T {x - x k ' 1 ), for all g e dm k<q {x k4 ). (48) 

Therefore, from fl26| ) and (ffTj), it follows that 

Q{x) - Q(x k ' e ) > g T (x - x k ' e ), for all g G dm ktq {x k ' e ), 

so that 

dm Kq {x k ' 1 ) C aQ(x fe ^). (49) 

Since Q(x k ) < Q(x°) = Q , we have from M) that x k € £(Qo). Therefore, 
from the definition ( p3[ ) and the fact that \\x k ' 1 — x k \\ < A k j < Ahi, we have 
that x k ' e eC(Q ; A hi ). It follows from (f|) and @ that 

||<?||i</3, forall 3 e9m M (^ £ ). (50) 

Since is rejected by the test (|23|), we have from ([47j ) and Lemma |l| that 
the following inequalities hold: 

m ktq (xW) = Q{x k > 1 ) > Q(x k ) - £ [Q{x k ) - m^x^)] 

> Q(x k ) - i[Q{x k ) - m Ml . 

By rearranging this expression, we obtain 

Q(x k )-m k , q (x k ' e )<Z[Q(x k )-m kitl (x k ^)] . (51) 

Consider now all points x satisfying 

\\x-x k ' e \\oo < [Q{x k )-m k>ll {x k ^)] d ^ f C>0. (52) 



Using this bound together with (|48j) and (50), we obtain 

m ktq {x k ' 1 ) - m kjq (x) < g T {x k/ - x) 
< P\\x k >* - iHoo < (fj - [Q(x k ) - m kA {x k ^)] . 

By combining this bound with (J5l| ), we find that the following bound is satisfied 
for all x in the neighborhood (|52|): 

Q(x k ) - m k>q (x) = [Q(x k ) - m Kq {x k ^)] + [m k , q (x k ' e ) - m k , q (x)] 
<v[Q{x k )~m kM {x k ^)] . 
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It follows from this bound, in conjunction with (|46|), that x k ' q (the solution of the 
trust-region problem with model function mk,q) cannot lie in the neighborhood 
(|52|). Therefore, we have 

||^-^|| 00 >C (53) 

But since \\x k ' e — x fe || 00 < A k < A^ for all i > i\, it is impossible for an infinite 
sequence {x k ' i }i>i 1 to satisfy (|55|). We conclude that ( fi5|) must hold for some 
(-2 >(■!■> as claimed. 

We now show that the minor iteration sequence terminates at a point x ' 
satisfying the acceptance test, provided that x k is not a solution. 

Theorem 1. Suppose that e to \ = 0. 

(i) If x k <f_ S, there is an I > such that x k ' e satisfies (fj^. 
(ii) If x k 6 S, then either Algorithm TR terminates (and verifies that x k G S), 
or Q(x k ) - to m (ie m ) 1 0. 

Proof. Suppose for the moment that the inner iteration sequence is infinite, that 
is, the test ( [23|) always fails. By applying Lemma ^] recursively, with any constant 
f) satisfying the properties stated in Lemma |ij we can identify a sequence of 
indices < l\ < I2 < ■ ■ . such that 

Q(x k ) - m Mj (x k & ) < fj [Q(x k ) - mk^-t (x^- 1 )] 
<fj 2 [Q{x k )-m k , ti _ 2 {x h > l i-*)] 



<ff [Q(x k )-m kfi (x k ' )]. (54) 
When x k ^ S, we have from Lemma || that 

A kJl > min(Ao, E k /4) d = A lo > 0, for all £ = 0,1,2,..., 



so the right-hand side of (|36aJ) is strictly positive. Hence for j sufficiently large, 
we have that 

Q(x k ) - m k , tj {x k ^) < 0.5 min (A , \\x k - P(^)|U) S^S, ■ 

\\X K - P{x K )\\oo 

But this inequality contradicts ([36]), proving (i). 

For the case of x k G S, there are two possibilities. If the inner iteration 
sequence terminates finitely at some x k ' 1 , we have Q(x k ) — m k ^{x k ' ) = and 
indeed that 

m k ,t(x) > Q{x k ) = Q*, for all x with \\x - x k \\ oc < A kii . 



Because of (f26[), we have that Q(x) > Q{x k ) for all a; in a neighborhood of x , 
implying that € dQ(x k ). Therefore, termination under these circumstances 
yields a guarantee that x k 6 S. When the algorithm does not terminate, it 
follows from (|4|) that Q(x k ) — m k n(x k,t ) — > 0. By applying Lemma [l], we verify 
our claim (ii) of monotonic convergence. 
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We now prove convergence of Algorithm TR to S. 

Theorem 2. Suppose that etoi = 0. The sequence of major iterations {x k } is 
either finite, terminating at some x k S S, or is infinite, with the property that 
||i fc -P(a;*)||oo-0. 

Proof. If the claim does not hold, there are two possibilities. The first is that 
the sequence of major iterations terminates finitely at some x k £ S. However, 
Theorem ^ ensures, however, that the minor iteration sequence will terminate 
at some new major iteration x k+1 under these circumstances, so we can rule out 
this possibility. The second possibility is that the sequence {x k } is infinite but 
that there is some e > and an infinite subsequence of indices {fcj}j=i.2,... such 
that 

11^- -P^Wc >e, j =0,1,2,.... 

Since the sequence {Q(x kj )}j=i,2,... is infinite, decreasing, and bounded below, 
it converges to some value Q > Q*. Moreover, since the entire sequence {Q(x k )} 
is monotone decreasing, it follows that Q(x k ) > Q and therefore 

Q(x k )~Q* > Q-Q* >0, k = 0,1,2,.... 

Hence, by boundedness of the subgradients (see ([34])), we can identify a constant 
e > such that 

||x fc -P(x fc )|U >e, k = 0,1,2,.... 
It follows from ( |39| ) that 

E k >e, k = 0,1,2,.... (55) 

For each major iteration index fc, let £(k) be the minor iteration index that 
passes the acceptance test (|2^) . By combining (^3|) with Lemma 0, we have that 

Q(x k ) - Q(x k+1 ) > Cemin (A k<m , \\x k - P^U) > Semin (A kAk) ,e) . 

Since Q(x k ) - Q(x k+1 ) — > 0, we deduce that 

lim A ki r k ) = 0. (56) 

k — >oo 

By Lemma || and (^5|) , we have 

A km > min(Ao, e/4) > 0, k = 0, 1, 2, . . . , 

which contradicts (|56|). We conclude that the second possibility (an infinite se- 
quence {x k } not converging to iS) cannot occur either, so the proof is complete. 

Finally, we show that the algorithm terminates when e to i > 0. 
Theorem 3. When e to i > 0, Algorithm TR terminates finitely. 
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Proof. We show first that the algorithm cannot "get stuck" at a particular x k , 
generating an infinite sequence of minor iterations at x k without eventually 
satisfying either (|l|) or the acceptance test (p3|). We see from the reasoning in 
the proof of Theorem ^ together with the monotonicity property of Lemma ^ 
that an infinite sequence of minor iterations must satisfy that 

Q{x k )-m kl {x k ' 1 ) 10. (57) 

Since the right-hand side of ( |3l| ) is bounded below by e to i, the test ( |3l| ) must be 
satisfied for some I. Therefore, the minor iteration sequence cannot be infinite. 

Now consider the other possibility of an infinite sequence of major iterations 
{x k }k=i,2....- Since we have 

Q(x fc ) - m M (x M ) > e t oi 

for all k and £, and since the acceptance test ( p3| ) is satisfied at all k, we have 

Q(x k ) - Q{x k+1 ) > £e to l > 0, for all k = 0, 1, 2 . . .. 

But this relation is inconsistent with the fact that {Q(x k )} is bounded below 
(by Q*), so this possibility can also be ruled out, and the proof is complete. 



3.3. Discussion 

The algorithm can be modified in various ways without changing its properties 
greatly. For instance, we could replace the step norm bound in (j2fj) by a scaled 
bound of the form 

\\S{x - x^U < A k , 

where S is a diagonal positive definite matrix. After this modification, (|TJ) re- 
mains a linear program. We could also use a 1-norm trust region, at the cost of 
introducing an additional variable vector s of the same dimension as x. Specifi- 
cally, we enforce the constraint \\x — x k \\i < A k by enforcing the following linear 
constraints: 

x — x k < s, x k - x < s, e T s < A k . 

Once again, we obtain a linear programming subproblem, albeit one that involves 
more variables than ( |2l| ) 

If a 2-norm trust region is used, we can show by comparing the optimality 
conditions for the respective problems that the solution of the subproblem 

min mfe i(x) subject to Ax — b, x > 0, \\x — x k |U < A k 

is identical to the solution of 

min m,k,i{x) + X\\x — x k \\ 2 subject to Ax — b, x > 0, (58) 



for some A > 0. We can transform (58) to a quadratic program in the same 
fashion as the transformation of ( pp| ) to (|2l|) . The bundle-trust-region approaches 
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described in Kiwiel m& , Hirart-Urruty and Lemarechal p4| Chapter XV] , and 



Ruszczynski [£1 22 also lead to problems of the form (pq). These approaches 
manipulate the parameter A rather than adjusting the trust-region radius, more 
in the spirit of the Lcvcnberg-Marquardt method for least-squares problems than 
of a true trust-region method. Hence, their analysis differs somewhat from that 
of the preceding section. Moreover, although quadratic programming solvers that 
exploit the special structure of the quadratic term in ( |58| ) have been designed and 
implemented (see ) , we believe that the linear programming subproblem ( pl| ) 
is more appealing from a practical point of view. Improvements in the efficiency 
and ease of use of linear programming software have continued to occur at a 
rapid pace, and availability of high-quality software has made it much easier to 
implement an efficient algorithm based on (^lj) than would have been the case 
if the subproblems had the form (|58|). 



4. An Asynchronous Bundle-Trust-Region Method 



In this section we present an asynchronous, parallel version of the trust-region 
algorithm of the preceding section and analyze its convergence properties. 



4.1. Algorithm ATR 

We now define a variant of the method of Section || that allows the partial sums 
Qui , j = 1,2, ... ,T (fill ) and their associated cuts to be evaluated simultaneously 
for different values of x. We generate candidate iterates by solving trust-region 
subproblems centered on an "incumbent" iterate, which (after a startup phase) 
is the point x 1 that, roughly speaking, is the best among those visited by the 
algorithm whose function value Q{x) is fully known. 

By performing evaluations of Q at different points concurrently, we relax the 
strict synchronicity requirements of Algorithm TR, which requires Q(x k ) to be 
evaluated fully before the next candidate x k+1 is generated. The resulting ap- 
proach, which we call Algorithm ATR (for "asynchronous TR"), is more suitable 
for implementation on computational grids of the type we consider here. Besides 
the obvious increase in parallelism that goes with evaluating several points at 
once, there is no longer a risk of the entire computation being help up by the slow 
evaluation of one of the partial sums Quj on a recalcitrant worker. Algorithm 
ATR has similar theoretical properties to Algorithm TR, since the mechanisms 
for accepting a point as the new incumbent, adjusting the size of the trust region, 
and adding and deleting cuts are all similar to the corresponding mechanisms in 
Algorithm TR. 

Algorithm ATR maintains a "basket" B of at most K points for which the 
value of Q and associated subgradient information is partially known. When the 
evaluation of Q(x q ) is completed for a particular point x q in the basket, it is 
installed as the new incumbent if (i) its objective value is smaller than that of 
the current incumbent x 1 ; and (ii) it passes a trust-region acceptance test like 
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(p3|), with the incumbent at the time x q was generated playing the role of the 
previous major iteration in Algorithm TR. Whether x q becomes the incumbent 
or not, it is removed from the basket. 

When a vacancy arises in the basket, we may generate a new point by solv- 
ing a trust-region subproblem similar to (^0|), centering the trust region at the 
current incumbent x 1 . During the startup phase, while the basket is being popu- 
lated, we wait until the evaluation of some other point in the basket has reached 
a certain level of completion (that is, until a proportion a 6 (0, 1] of the par- 
tial sums ( pi] ) and their subgradients have been evaluated) before generating a 
new point. We use a logical variable speceval^ to indicate when the evaluation 
of x q passes the specified threshold and to ensure that x q does not trigger the 
evaluation of more than one new iterate. (Both a and speceval^ play a similar 
role in Algorithm ALS.) After the startup phase is complete (that is, after the 
basket has been filled), vacancies arise only after evaluation of an iterate x q is 
completed. 

We use m(-) (without subscripts) to denote the model function for Q(-). 
When generating a new iterate, we use whatever cuts are stored at the time to 
define m. When solved around the incumbent x 1 with trust-region radius A, the 
subproblem is as follows: 

trsub(x 7 , A) : min m{x) subject to Ax = b, x > 0, — x \\oo < A. (59) 

X 

We refer to x l as the 'parent incumbent of the solution of ( |59"| ) . 

In the following description, we use k to index the successive points x k that 
are explored by the algorithm, / to denote the index of the incumbent, and B 
to denote the basket. We use tk to count the number of partial sums Q[j](x k ), 
j = 1, 2, . . . , T that have been evaluated so far. 

Given a starting guess a; , we initialize the algorithm by setting the dummy 
point x~ x to x° , setting the incumbent index / to —1, and setting the initial 
incumbent value Q 1 — Q^ 1 to oo. The iterate at which the first evaluation is 
completed becomes the first "serious" incumbent. 

We now outline some other notation used in specifying Algorithm ATR: 

Q} : The objective value of the incumbent x 1 , except in the case of / = —1, in 

which case <2 _1 ~ oo. 
I q : The index of the parent incumbent of x q , that is, the incumbent index I 
at the time that x q was generated from (|59|). Hence, Q Iq ~ Q(x Iq ) (except 
when I q = —1; see previous item). 
A q : The value of the trust-region radius A used when solving for x q . 
^cun-: Current value of the trust-region radius. When it comes time to solve ([59]) 
to obtain a new iterate x q , we set A q <— Z\ C urr- 
m q : The optimal value of the objective function m in the subproblem trsut^x 7 ', A q ) 

©• 

Our strategy for maintaining the model closely follows that of Algorithm TR. 
Whenever the incumbent changes, we have a fairly free hand in deleting the cuts 
that define m, just as we do after accepting a new major iterate in Algorithm TR. 



Stochastic Programming on a Computational Grid 



23 



If the incumbent does not change for a long sequence of iterations (corresponding 
to a long sequence of minor iterations in Algorithm TR), we can still delete 
"stale" cuts that represent information in m that has likely been superseded (as 
quantified by a parameter r\ £ [0, 1)). The following version of Procedure Model- 
Update, which applies to Algorithm ATR, takes as an argument the index k of 
the latest iterate generated by the algorithm. It is called after the evaluation of 
Q at an earlier iterate x q has just been completed, but x q does not meet the 
conditions needed to become the new incumbent. 

Procedure Model-Update (fc) 
for each optimality cut defining m 
possible_delete <— true; 

if the cut was generated at the parent incumbent Ik of k 

possible_delete <— false; 
else if the cut was active at the solution x k of trsub(x /fc , Ak) 

possible_delete <— false; 
else if the cut was generated at an earlier iteration I 
such that Ij = Ik 7^ — 1 and 

Q Ik - m k > V [Q Ik - m l \ (60) 

possible_delete <— false; 
end (if) 

if possible_delete 

possibly delete the cut; 
end (for each) 

Our strategy for adjusting the trust region Z\ curr also follows that of Algo- 
rithm TR. The differences arise from the fact that between the time an iterate 
x q is generated and its function value Q(x q ) becomes known, other adjustments 
of Zicurront may have occurred, as the evaluation of intervening iterates is com- 
pleted. The version of Procedure Reduce- A for Algorithm ATR is as follows. 

Procedure Reduce- A(q) 
ifj, = -l 

return; 
evaluate 



min(l,A,) ^_ m i ; (61) 



if p > 

counter <— counter+1; 
if p > 3 or (counter > 3 and p £ (1, 3]) 

set A+ <- Z\,/min(p,4); 
set Z\ curr <— min(Z\ curr ,Z\+); 
reset counter <— 0; 
return. 
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The protocol for increasing the trust region after a successful step is based 
on (pg|), (j30|). If on completion of evaluation of Q(x q ), the iterate x q becomes 
the new incumbent, then we test the following condition: 

Q(x q ) < Q Iq - 0.5(Q J « - m q ) and \\x q - x 1 * IU = A q . (62) 

If this condition is satisfied, we set 

A cuir «- max(Z\ curr , min(Z\ h i, 2/A g )). (63) 

The convergence test is also similar to the test (^) used for Algorithm TR. 
We terminate if, on generation of a new iterate x k , we find that 

Q 1 - m k < e to i(l + |Q J |). (64) 

We now specify the four key routines of the Algorithm ATR, which serve a 
similar function to the four main routines of Algorithm ALS. As in the earlier 
case, we assume for simplicity of description that each task consists of evaluation 
of the function and a subgradient for a single cluster (although in practice we may 
bundle more than one cluster into a single task). The routine partial_evaluate 
executes on worker processors, while the other three routines execute on the 
master processor. 

ATR: partial_evaluate(x 9 , j, Q[j](x q ), gj) 

Given cc 9 , index q, and partition number j, evaluate Q\j](x q ) from ( |ll"| ) 

together with a partial subgradient gj from (|l3|); 
Activate act_on_completed_task(a; 9 , q, j, Q\j](x q ), gj) on the master processor. 



ATR: evaluate (.t 9 , q) 

for j = 1, 2, . . . , T (possibly concurrently) 

partial_evaluate(a; 9 , q, j, Qyj (x q ),gj); 
end (for) 



ATR: initialization^ ) 

choose £ 6 (0, 1/2), trust region upper bound Z\hi > 0; 

choose synchronicity parameter a G (0, 1]; 

choose maximum basket size K > 0; 

choose A cun . G (0, ^hi]j counter <— 0; £ <— 0; 

/ 4- -1; x- 1 4- x°; Q- 1 <- oo; I <- -1; 

fc <— 0; speceval <— false; to *~ 0; 

evaluate(a; , 0). 
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ATR: act_on_completed_task(a;' ? , q,j, Qui (x q ), gj)) 

tq * tq -\- 1, 

add Q[j](x q ) and cut gj to the model m; 
basketFill <— false; basketUpdate <— false; 
if t g = T (* evaluation of Q(x q ) is complete *) 

if Q(x q ) < Q! and (J, = -1 or Q(^) < Q J * - £(Q Z * - m 9 )) 

(* make x q the new incumbent *) 

/ - q; Q 1 - Q(^); 

possibly increase zi curr according to (|6|) and (p3[); 
modify the model function by possibly deleting cuts not arising 
from the evaluation of Q(x q ); 

else 

call Model-Update(fc); 

call Reduce- A(q) to update Z\ curr ; 
end (if) 
B^B\{q}; 

basketUpdate <— true; 
else if t q > <tT and |B| < if and not speceval ? 

(* basket-filling phase: enough partial sums have been evaluated at x q 

to trigger calculation of a new candidate iterate *) 
speceval 9 ^true; basketFill <— true; 
end (if) 

if basketFill or basketUpdate 

k <— k + 1; set A k *- Z\ curr ; set I k <- 7; 
solve trsub(x / , /!&) to obtain cc fc ; 
m <— m(x k ); 
if © holds 
STOP; 
B^BU{k}; 

speceval fc <— false; tj. <— 0; 
evaluate(a; , fc); 
end (if) 

It is not generally true that the first K iterates a; , x 1 , ... , x K ~ Y generated by 
the algorithm are all basket-filling iterates. Often, an evaluation of some iterate is 
completed before the basket has filled completely, so a "basket-update" iterate is 
used to generate a replacement for this point. Since each basket-update iterate 
does not change the size of the basket, however, the number of basket-filling 
iterates that are generated in the course of the algorithm is exactly K. 

4.2. Analysis of Algorithm ATR 

We now analyze Algorithm ATR, showing that its convergence properties are 
similar to those of Algorithm TR. Throughout, we make the following assump- 
tion: 

Every task is completed after a finite time. (65) 
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The analysis follows closely that of Algorithm TR presented in Section 3.2. 
We state the analogues of all the lemmas and theorems from the earlier section, 
incorporating the changes and redefinitions needed to handle Algorithm ATR. 
Most of the details of the proofs are omitted, however, since they are similar to 
those of the earlier results. 

We start by defining the level set within which the points and incumbents 
generated by ATR lie. 

Lemma 5. All incumbents x 1 generated by ATR lie in £(Q max ), whereas all 
points x k considered by the algorithm lie in £(Q max ; A^), where £(•) and £(•; •) 
are defined by M^ ) and (jffj^, respectively, and <2 max is defined by 

Qmax = sup{Q(z) | \\x ~ x° \\ < A hi } . 

Proof. Consider first what happens in ATR before the first function evaluation 
is complete. Up to this point, all the iterates x k in the basket are generated 
in the basket-filling part and therefore satisfy \\x k — x°\\ < Ak < A\ li: with 
Q Ih = Or 1 = oo. 

When the first evaluation is completed (by x k , say), it trivially passes the 
test to be accepted as the new incumbent. Hence, the first noninfinite incumbent 
value becomes Q 1 = Q(x k ), and by definition we have Q 1 < Q max . Since all later 
incumbents must have objective values smaller than this first Q 1 , they all must 
lie in the level set £(Q max ), proving our first statement. 

All points x k generated within act_on_completed_task lie within a distance 
Ak < A^i either of x° or of one of the later incumbents x 1 . Since all the incum- 
bents, including x°, lie in £(Q max ), we conclude that the second claim in the 
theorem is also true. 

Analogously with (3 , we define a bound on the subgradients over the set 
£(Q max ; Aii) as follows: 

(3 = sup{||g||i \ g G dQ(x), for some x G C(Q max ; A bi )} . (66) 

The next result is analogous to Lemma 0. It shows that for any sequence of 
iterates x k for which the parent incumbent xr k is the same, the optimal objective 
value in trsub(a; /fc , Ak) is monotonically increasing. 

Lemma 6. Consider any contiguous subsequence of iterates x k , k — ki,ki + 
l,...,/c2 for which the parent incumbent is identical; that is, Ik x — Ik t +i = 
• • • = Ik 2 - Then we have 

m kl < m kl+1 < ■■■ <m k2 . 

Proof. We select any k = hi, ki + 1, . . . , fe — 1 and prove that m k < m k+1 . Since 
x k and x k+1 have the same parent incumbent (x 1 , say), no new incumbent has 
been accepted between the generation of these two iterates, so the wholesale cut 
deletion that may occur with the adoption of a new incumbent cannot have oc- 
curred. There may, however, have been a call to Model-Update(fc). The first "else 
if" clause in Model-Update would have ensured that cuts active at the solution 
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of trsub(x / , Ak) were still present in the model when we solved trsub(a; / , Ak+i) 
to obtain x k+1 . Moreover, since no new incumbent was accepted, Z\ C urr cannot 
have been increased, and we have A k +\ < Ak- We now use the same argument 
as in the proof of Lemma |l| to deduce that m k < m k+1 . 

The following result is analogous to Lemma We omit the proof, which 
modulo the change in notation is identical to the earlier result. 

Lemma 7. For all k = 0, 1, 2, . . . such that I k ^ —1, we have that 

Q 1 * -m k > mm (A k , \\x^ P^)^) (67a) 
> emm(A k ,\\x Ik - P(a; 7fc )|| 00 ) , (67b) 
where e > is defined in Alt 



The following analogue of Lemma |3| requires a slight redefinition of the quan- 
tity Ej~ from (|3J]). We now define it to be the closest approach by an incumbent 
to the solution set, up to and including iteration fc; that is, 

£*= £ min Wx 1 - -P(^)||oo. (68) 

/c=0 : l,...,fc;Jfc#-l 

We also omit the proof of the following result, which, allowing for the change of 
notation, is almost identical to that of Lemma [j| 

Lemma 8. There is a constant A\ a > such that for all trust regions A k used 
in the course of Algorithm ATR, we have 

A k > min(Ao,Pfe/4). 

The value of A\ Q that works in this case is A\ Q = (1/4) min(l, e//3, Z\hi), where 
/3 comes from (|66|). 

There is also an analogue of Lemma ^ that shows that if the incumbent re- 
mains the same for a number of consecutive iterations, the gap between incum- 
bent objective value and model function decreases significantly as the iterations 
proceed. 

Lemma 9. Let etoi = in Algorithm ATR, and let fj be any constant satisfying 
< fj < I, fj > fj > rj. Choosing any index k\ with Ik x ^ — 1, we have either 
that the incumbent Ik ± = I is eventually replaced by a new incumbent or that 
there is an iteration k,2 > k\ such that 

Q 1 - m k2 < fj [Q 1 - m kl ] . (69) 

The proof of this result follows closely that of its antecedent Lemma [|. The key 
is in the construction of the Model-Update procedure. As long as 

Q 1 -m k > rjlQ 1 - m kl ], for k > ki, where I = I kl = I k , (70) 
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none of the cuts generated during the evaluation of Q(x q ) for any q = k±, k% + 

1. . . . , k can be deleted. The proof technique of Lemma |^ can then be used to 
show that the successive iterates x kl , x kl+1 , . . . cannot be too closely spaced if 
the condition ([?0|) is to hold and if all of them fail to satisfy the test to become 
a new incumbent. Since they all belong to a box of finite size centered on x 1 , 
there can be only finitely many of these iterates. Hence, either a new incumbent 
is adopted at some iteration k > k\ or condition (|6^) is eventually satisfied. 

We now show that the algorithm cannot "get stuck" at a nonoptimal incum- 
bent. The following result is analogous to Theorem [j], and its proof relies on the 
earlier results in exactly the same way. 

Theorem 4. Suppose that e to \ = 0. 

(i) If x 1 £S, then this incumbent is replaced by a new incumbent after a finite 
time. 

(ii) If x 1 G S, then either Algorithm ATR terminates (and verifies that x 1 £ S), 
or Q 1 — m k | as k — » oo. 

We conclude with the result that shows convergence of the sequence of in- 
cumbents to S. Once again, the logic of proof follows that of the synchronous 
analogue Theorem ^| 

Theorem 5. Suppose that e to \ = 0. The sequence of incumbents {x Ik }fe=o,i,2,... 
is either finite, terminating at some x 1 6 S or is infinite with the property that 
||a^* - -P(a: J *)I|oo 

5. Implementation on Computational Grids 

We now describe some salient properties of the computational environment in 
which we implemented the algorithms, namely, a computational grid running 
the Condor system and the MW runtime support library. 

5.1. Properties of Grids 

The term "grid computing" (synonymously "metacomputing" ) is generally used 
to describe parallel computations on a geographically distributed, heterogeneous 
computing platform. Within this framework there are several variants of the con- 
cept. The one of interest here is a parallel platform made up of shared worksta- 
tions, nodes of PC clusters, and supercomputers. Although such platforms are 
potentially powerful and inexpensive, they are difficult to harness for productive 
use, for the following reasons: 

— Poor communications properties. Latencies between the processors may be 
high, variable, and unpredictable. 

— Unreliability. Resources may disappear without notice. A workstation per- 
forming part of our computation may be reclaimed by its owner and our job 
terminated. 
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— Dynamic availability. The pool of available processors grows and shrinks 
during the computation, according to the claims of other users and scheduling 
considerations at some of the nodes. 

— Heterogeneity. Resources may vary in their operational characteristics (mem- 
ory, swap space, processor speed, operating system). 

In all these respects, our target platform differs from conventional multiprocessor 
platforms (such as IBM SP or SGI Origin machines) and from Linux clusters. 

5.2. Condor 

Our particular interest is in grid computing platforms based on the Condor sys- 
tem Jl7| , which manages distributively owned collections ( "pools" ) of processors 
of different types, including workstations, nodes from PC clusters, and nodes 
from conventional multiprocessor platforms. When a user submits a job, the 
Condor system discovers a suitable processor for the job in the pool, transfers 
the executable and starts the job on that processor. It traps system calls (such 
as input/output operations), referring them back to the submitting workstation, 
and checkpoints the state of the job periodically. It also migrates the job to a 
different processor in the pool if the current host becomes unavailable for any 
reason (for example, if the workstation is reclaimed by its owner). Condor man- 
aged processes can communicate through a Condor-enabled version of PVM Q 
or by using Condor's I/O trapping to write into and read from a series of shared 
files. 

5.3. Implementation in MW 

MW (see Goux, Linderoth, and Yoder |l3| and Goux et al. |l^| ) is a runtime 
support library that facilitates implementation of parallel master- worker applica- 
tions on computational grids. To implement MW on a particular computational 
grid, a grid programmer must reimplement a small number of functions to per- 
form basic operations for communications between processors and management 
of computational resources. These functions are encapsulated in the MWRM- 
Comm class. Of more relevance to the current paper is the other side of MW, 
the application programming interface presented to the application programmer. 
This interface takes the form of a set of three C++ abstract classes that must be 
reimplemented in a way that describes the particular application. These classes, 
named MWDriver, MWTask, and MWWorker, contain a total of ten methods 
for which the user must supply implementations. We describe these methods 
briefly, indicating how they are implemented for the particular case of the ATR 
and ALS algorithms. 

MWDriver. This class is made up of methods that execute on the submitting 
workstation, which acts as the master processor. It contains the following four 
CH — h pure virtual functions. (Naturally, other methods can be defined as needed 
to implement parts of the algorithm.) 
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— get_userinf o: Processes command-line arguments and does basic setup. In 
our applications this function reads a command file to set various parame- 
ters, including convergence tolerances, number of scenarios, number of partial 
sums to be evaluated in each task, maximum number of worker processors to 
be requested, initial trust region radius, and so on. It calls the routines that 
read and store the problem data files, and it reads the initial point, if one 
is supplied. It also performs the operations specified in the initialization 
routine of Algorithms ALS and ATR, except for the final evaluate operation, 
which is handled by the next function. 

— setup_initial_tasks: Defines the initial pool of tasks. In the case of Al- 
gorithms ALS and ATR, this function corresponds to a call to evaluate at 
x°. 

— pack_worker_init_data: Packs the initial data to be sent to each worker 
processor when it joins the pool. In our case, the information contained in 
the input files for the stochastic programming problem is sent to each worker. 
When the worker subsequently receives a task requiring it to solve a number 
of second-stage scenarios, it can use the original input data to generate the 
particular data for its assigned set of scenarios. By loading each new worker 
with the problem data, we avoid having to subsequently pass a complete set 
of data for every scenario in every task. 

— act_on_completed_task: Is called every time a task finishes, to process the 
results of the task and to take any actions arising from these results. See 
Algorithms ALS and ATR for our definition of this function in our applica- 
tions. 

The MWDriver base class performs many other operations associated with 
handling worker processes that join and leave the computation, assigning tasks 
to appropriate workers, rescheduling tasks when their host workers disappear 
without warning, and keeping track of performance data for the run. All this 
complexity is hidden from the application programmer. 

MWTask. The MWTask is the abstraction of a single task. It holds both the 
data describing that task and the results obtained by executing the task. The 
user must implement four functions for packing and unpacking this data and 
results between master and workers into simple data structures that can be 
communicated between master and workers using the appropriate primitives 
for the particular computational grid platform on which MW is implemented. In 
most of the results reported in Section the message-passing facilities of Condor- 
PVM were used to perform the communication. By simply changing compiler 
directives, the same algorithmic code can also be implemented on an alternative 
communication protocol that uses shared files to pass messages between master 
and workers. The large run reported in the next section used this version of the 
code. 

In our applications, each task evaluates the partial sum Qui (x) and a subgra- 
dient for a given number of clusters. The task is described by a range of scenario 
indices for each cluster in the task and by a value of the first-stage variables x. 
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The results consist of the function and subgradient for each of the clusters in 
the task. 

MWWorker. The MWWorker class is the core of the executable that runs on 
each worker. The user must implement two pure virtual functions: 

— unpack_init_data: Unpacks the initial information passed to the worker by 
the MWDriver function pack_worker_init_data() when the worker joins 
the pool. (See the discussion of pack_worker_init_data in the MWDriver 
class.) 

— execute_task: Executes a single task. 

After initializing itself, using the information passed to it by the master, the 
worker process sits in a loop, waiting for tasks to be sent to it. When it detects 
a new task, it calls execute_task to compute the results. It passes the results 
back to the worker by using the appropriate function from the MWTask class, 
and then returns to its wait loop. The wait loop terminates when the master 
sends a termination message. In our applications, the execute_task() function 
formulates the second-stage linear programs in its clusters by using the informa- 
tion in the task definition and the data passed to the worker on initialization. 
It then calls the linear programming solvers SOPLEX or CPLEX to solve these 
linear programs, and uses the dual solutions to calculate the subgradient for each 
cluster. 

6. Computational Results 

We now report on computational experiments obtained with implementations 
of the ALS, TR, and ATR algorithms using MW on the Condor system. After 
describing some further details of the implementations and the experiments, we 
discuss our choices for the various algorithmic parameters and how these were 
varied between runs. We then tabulate and discuss the results. 

6.1. Implementations and Experiments 

As noted earlier, we used the Condor-PVM implementation of MW for most 
of the the runs reported here. Most of the computational time is taken up 
with solving linear programming problems, both by the master process (in 
the act_on_completed_task function) and in the tasks, which solve clusters of 
second-stage linear programs. We used the CPLEX simplex solver on the master 
processor and the SOPLEX public-domain simplex code (see Wunderling p6[ ) 
on the workers. SOPLEX is somewhat slower in general, but since most of the 
machines in the Condor pool do not have CPLEX licenses, there was little al- 
ternative but to use a public-domain code. 

We ran most of our experiments on the Condor pool at the University of 
Wisconsin, sometimes using Condor's flocking mechanism to augment this pool 
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with processors from other sites. The other sites included the University of New 
Mexico, Columbia University, and the Linux cluster Chiba City at Argonne Na- 
tional Laboratory. The architectures included PCs running Linux, and PCs and 
Sun workstations running different versions of Solaris. The number of workers 
available for our use varied dramatically between and during each set of trials, 
because of the differing priorities of the two accounts we used, the variation of 
our priority during each run, the number and priorities of other users of the 
Condor pool at the time, and the varying number of machines available to the 
pool. The latter number tends to be larger during the night, when owners of the 
individual workstations are less likely to be using them. The master process was 
run on a Linux machine in some experiments and an Intel Solaris machine in 
other cases. 

The input files for the problems reported here were in SMPS format (see Birge 
et al. [[| and Gassmann and Schweitzer We considered two-stage stochastic 
linear programs in which the number of scenarios is finite but extremely large. 
We used Monte Carlo sampling to obtain approximate problems with a specified 
number N of second-stage scenarios. Brief descriptions of the test problems can 
be found at |T^| . In each experiment, we supplied a starting point to the code, 
obtained from the solution of a different sampled instance of the same problem. 
The function value of the starting point was therefore quite close to the optimal 
objective value. 

6.2. Critical Parameters 

As part of the initialization procedure (implemented by the get_userinf o func- 
tion in the MWDriver class), the code reads an input file in which various param- 
eters are specified. Several parameters, such as those associated with modifying 
the size of the trust region, have fixed values that we have discussed already in 
the text. Others are assigned the same values for all algorithms and all experi- 
ments, namely, 

e tol = l(T 5 , Z\ hi = 10 3 , A), = A> = 1, e = l(T 4 . 

We also set 77 = in the Model-Update functions in both TR and ATR. In TR, 
this choice has the effect of not allowing deletion of cuts generated during any 
major iterations, until a new major iterate is accepted. In ATR, the effect is to 
not allow deletion of cuts that are generated at points whose parent incumbent 
is still the incumbent. Even among cuts for which possible_delete is still true 
at the final conditional statement of the Model-Update procedures, we do not 
actually delete the cuts until they have been inactive at the solution of the 
trust-region subproblem for a specified number of consecutive iterations. For TR, 
we delete the cut if it has been inactive for more than 100 consecutive minor 
iterations, while in ATR we delete the cut if it was last active at subproblem £, 
where I < k — 100 and k is the current iteration index. Our cut deletion strategy 
is therefore not at all parsimonious; it tends to lead to subproblems ( p0| ) and ( p9| ) 
with fairly large numbers of cuts. In most cases, however, the storage required 
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for these cuts and the time required to solve the subproblems remain reasonable. 
We discuss the exceptions below. 

The synchronicity parameter cr, which arises in Algorithms ALS and ATR 
and which specifies the proportion of clusters from a particular point that must 
be evaluated in order to trigger evaluation of a new candidate solution, is varied 
between .5 and 1.0 in our experiments. The size K of the basket B is varied 
between 1 and 14. For each problem, the number T of clusters is also varied in a 
manner described in the tables, as is the number of tasks into which the second- 
stage calculations are divided, which we denote by C. Note that the number of 
second-stage LPs per chunk is therefore N/C while the number per cluster is 
N/T. 

The MW library allows us to specify an upper bound on the number of 
workers we request from the Condor pool, so that we can avoid claiming more 
workers than we can utilize effectively. We calculate a rough estimate of this 
number based on the number of tasks C per evaluation of Q{x) and the basket 
size K. For instance, the synchronous TR and LS algorithms can never use more 
than C worker processors, since they evaluate Q at just one a; at a time. In the 
case of TR and ATR, we request mid(25, 200, [(K + l)C/2j ) workers. For ALS, 
we request mid(25, 200, 2C) workers. 

We have a single code that implements all four algorithms LS, ALS, TR, and 
ATR, using logical branches within the code to distinguish between the L-shaped 
and trust-region variants. There is no distinction in the code between the two 
synchronous variants and their asynchronous counterparts. Instead, by setting 
a = 1.0, we force synchronicity by ensuring that the algorithm considers only 
one value of a; at a time. 

Whenever a worker processor joins the computation, MW sends it a bench- 
mark task that typifies the type of task it will receive during the run. In our 
case, we define the benchmark task to be the solution of N/C second-stage LPs. 
The time required for the processor to solve this task is logged, and we set the 
ordering policy so as to ensure that when more than one worker is available to 
process a particular task, the task is sent to the worker that logged the fastest 
time on the benchmark task. 

6.3. Results: Varying Parameter Choices 

In this section we describe a series of experiments on the same problem, using 
different parameter settings, and run under different conditions on the Condor 
pool. For these trials, we use the problem SSN, which arises from a network 
design application described by Sen, Doverspike, and Cosares [p3| . This problem 
is based on a graph with 89 arcs, each representing a telecommunications link 
between two cities. The first-stage variables represent the (nonnegative) extra 
capacity to be added to each of these 89 arcs to meet an uncertain demand 
pattern. There is a constraint on the total added capacity. The demands consist 
of requests for service between pairs of nodes in the graph. For each set of 
requests, a route through the network of sufficient capacity to meet the requests 
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must be found, otherwise a penalty term for each request that cannot be satisfied 
is added to the objective. The second-stage problems are network flow problems 
for calculating the routing for a given set of demand flows. Each such problem 
is nontrivial: 706 variables, 175 constraints, and 2284 nonzeros in the constraint 
matrix. The uncertainty lies in the fact that the demand for service on each 
of the 86 pairs is not known exactly. Rather, there are three to seven possible 
scenarios for these demands, all independent of each other, giving a total of 
about 10 70 possible scenarios. We use Monte Carlo sampling to obtain a sampled 
approximation with N — 10, 000 scenarios. The deterministic equivalent for this 
sampled approximation has approximately 1.75 x 10 6 constraints and 7.06 x 10 6 
variables. In all the runs, we used as starting point the computed solution for 
a different sampled approximation — one with 20, 000 scenarios and a different 
random seed. The starting point had a function value of approximately 9.868860, 
whereas the optimal objective was approximately 9.832544. 
In the tables below we list the following information. 

— points evaluated. The number of distinct values of the first-stage variables 
x generated by solving the master subproblem — the problem ( |l8| ) for Algo- 
rithm ALS, © for Algorithm TR, and (§§) for Algorithm ATR. 

— \B\. Maximum size of the basket, also denoted above by K. 

— number of tasks (chunks). Denoted above by C . 

— number of clusters. Denoted above by T, the number of partial sums ( p"lj ) 
into which the second-stage problems are divided. 

— max processors. The number of workers requested. 

— average processors. The average of the number of active (nonsuspended) 
worker processors available for use by our problem during the run. Because 
of the dynamic nature of the Condor system, the actual number of available 
processors fluctuates continually during the run. 

— parallel efficiency. The proportion of time for which worker processors were 
kept busy solving second-stage problems while they were owned by this run. 
maximum number of cuts in the model. The maximum number of 
(partial) subgradients that are used to define the model function during the 
course of the algorithm. 

— masterproblem solve time. The total time spent solving the master sub- 
problem to generate new candidate iterates during the course of the algo- 
rithm. 

— wall clock. The total time (in minutes) between submission of the job and 
termination. 

Table [j] shows the results of a series of trials of Algorithm ALS with three 
different values of a (.5, .7, and .85) and three different choices for the number 
of chunks G into which the second-stage solutions were divided (10, 25, and 50). 
The number of clusters T was fixed at 50, so that up to 50 cuts were generated 
at each iteration. For a = .5, the number of values of x for which second-stage 
evaluations are occurring at any point in time ranged from 2 to 4 during the 
runs, while for a = .85, there were never more than 2 points being evaluated 
simultaneously. 
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Table 1. SSN, with N = 10,000 scenarios, Algorithm ALS. 



When these runs were performed, we were not able to obtain anything ap- 
proaching the requested number 2C of workers from the Condor pool. As general 
trends, we see that the less synchronous variants (with a = .5 and a = .7) tend 
to be faster than the more synchronous variant (with a = .85), except for the 
final run, during which more processors were available. Moreover, larger values 
of C also tend to produce faster runs. We also note that the number of iter- 
ations does not depend strongly on a. We would not, of course, expect C to 
affect strongly the number of iterations, but since it affects the manner in which 
the second-stage evaluation work is distributed, we would expect it to affect the 
run time. Since the number of workers available to us during this run was lim- 
ited, however, we did not see the full benefit of a finer-grained work distribution 
(C = 50), though the relatively low parallel efficiency of the final run (a = .85, 
C = 50) indicates that the benefits of more processors may not have been great 
in any case. 

A note on typical task sizes: For C = 10, a typical task required about 50-280 
seconds on a typical worker machine available to us, while for C — 50, about 9-60 
seconds were required. The large variation reflects the wide range in processing 
ability of the machines available in a pool during a typical run. These numbers 
also generally hold for the results in Tables || and ||. 

By comparing the results from Tabic |l| with those reported in Tables ^] and ||, 
we verified that Algorithm ALS was not as efficient on this problem as Algorithm 
TR and certain variants of Algorithm ATR. One advantage, however, was that 
the asymptotic convergence of ALS was quite fast. Having taken many iterations 
to build up a model and return to a neighborhood of the solution after having 
strayed far from it in early iterations, the last three to four iterations home in 
rapidly from a relatively crude approximate solution (a relative accuracy (Q m in~ 
m(a;' c+1 ))/(l + |Q m i n |) of between .0006 and .0026) to a solution of high accuracy. 

We now turn to Tables || and ||, which report on two sets of trials on the 
same problem as in Table El In these trials we varied the following parameters: 
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Table 2. SSN, with N = 10, 000 scenarios, first trial, Algorithms TR and ATR 



— basket size: K — 1 (synchronous TR) as well as K = 3, 6, 9, 14; 

— number of tasks: C — 10, 25, 50, as in Table [j]; 

— number of clusters: T = 50, 100. 

The parameter a was fixed at .7 in all these runs. 

The results in Table |^ were obtained with the master processor running on 
an Intel Solaris machine, while Table || was obtained with a Linux master. In 
both cases, the Condor pool that we tapped for worker processors was identical. 
Therefore, it is possible to do a meaningful comparison between each line of 
Table || and its counterpart in Table ||. Conditions on the Condor pool varied 
between and during each trial. This fact, combined with the properties of the 
algorithm, resulted in large variability of runtime from one trial to the next, as 
we discuss below. 
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Table 3. SSN, with N = 10, 000 scenarios, second trial, Algorithms TR and ATR. 



The nondeterministic nature of the algorithms is evident in doing a side-by- 
sidc comparison of the two tables. Even for synchronous TR, the slightly different 
numerical values for function and subgradient value returned by different workers 
in different runs results in slight variations in the iteration sequence and therefore 
slight differences in the number of iterations. For the asynchronous Algorithm 
ATR, the nondeterminism is even more marked. During the basket-filling phase 
of the algorithm, computation of a new x is triggered when a certain proportion 
of tasks from a current value of x has been returned. On different runs, the tasks 
will be returned in different orders, so the information used by the trust-region 
subproblem (59) in generating the new point will vary from run to run, and the 
resulting iteration sequences will generally show substantial differences. 

The synchronous TR algorithm is clearly better than the ATR variants with 
K > f in terms of total computation, which is roughly proportional to the 



38 



Jeff Linderoth, Stephen Wright 



number of iterations. In fact, the total amount of work increases steadily with 
basket size. Because of the decreased synchronicity requirements and the greater 
parallelism obtained for K > 1, the wall clock times (last columns) do not 
follow quite the same trend. The wall clock times for basket sizes K = 3 and 
K = 6 are at least competitive with the results obtained for the synchronous TR 
algorithm. The choice K — 6 gave few of the fastest runs but did yield consistent 
performance over all the different choices for the other parameters, and under 
different Condor pool conditions. 

The deleterious effects of synchronicity in Algorithm TR can be seen in its 
poor performance on several instances, particularly during the second trial. Let 
us compare, for instance, the entries in the two tables for the variant of TR with 
C = 50 and T = 100. In the first trial, this run used 42 worker processors on 
average and took 35 minutes, while in the second trial it used 30 workers on 
average and required 122 minutes. The difference in runtime is too large to be 
accounted for by the number of workers. Because this is a synchronous algorithm, 
the time required for each iteration is determined by the time required for the 
slowest worker to return the results of its task. In the first trial, almost all tasks 
required between 6 and 35 seconds, except for a few iterations that contained 
tasks that took up to 62 seconds. In the second trial, the slowest worker at each 
iteration almost always required more than 60 seconds to complete its task. We 
return to this point in discussing Table |] below. 

Other general observations we can make are that 100 clusters give almost 
uniformly better results in terms of wall clock time than 50 clusters, although 
the higher number results in a larger number of cuts in the trust-region sub- 
problems and an increased amount of time on the master processor in solving 
these problems. The latter factor is critical for K = 9 and K — 14, which do 
not compare favorably with the smaller values of K on this problem, even if 
many more worker processors are available. For the large basket sizes, the loss of 
control induced by the increase in assynchronicity leads to a significantly larger 
number of points that are evaluated. 

In all cases, it takes some time for the model m to become a good enough ap- 
proximation to Q that it generates a step that meets the trust-region acceptance 
criteria. The six TR runs in Table ||, for instance, required 18, 27, 16, 22, 16, and 
26 trust-region subproblems to be solved, respectively, before they stepped away 
from the initial point. (Note that, as expected, the runs with T — 100 required 
fewer such iterations than those with T = 50.) After the first step is taken, most 
steps are successful; that is, the first minor iterate usually is accepted as the next 
major iterate. Occasionally, two to four minor iterations are required before the 
next major iteration is identified. Similar behavior is observed for the runs of 
ATR, except that successful iterations are more widely spaced. For the first run 
with K = 6 in Table ||, for instance, the 37th solution of ( |59| ) yields the first 
successful step; then 36 of the following 99 solutions of the subproblem yield 
successful steps. 

In Table |^, we took the most promising parameter combinations from Ta- 
bles |^ and ^ and ran three trials with each combination. The Condor pool condi- 
tions varied widely during this trial, as can be seen by the way that the average 
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Table 4. SSN final trial with best parameter combinations, N = 10, 000 scenarios, Algorithms 
TR and ATR. 



number of workers varies within each group of three runs. For the asynchronous 
ATR runs, the differences in wall clock times within each set of three runs usually 
can be explained in terms of the varying number of workers available. (A possible 
exception is the last line of the table, the third run of ATR with K — 6, C = 50 
and T — 100, which took almost four times as long as the first run while having 
only slightly fewer than half as many processors. While the speed of machines 
available was roughly similar between these runs, the third run was plagued with 
numerous suspensions as the workers were reclaimed by their owners. Total time 
that workers were suspended was over 23,000 seconds on the third run and less 
than 2,800 seconds during the first run.) On the other hand, the variability in 
wall clock time between the six runs of the synchronous TR algorithm was due 
not to the number of available workers but rather to the synchronicity effect 
described above. In the run reported in the first line of the table, for instance, 
the slowest worker on any iteration typically took less than 65 seconds. In the 
run reported on the third line, the time required by the slowest worker varied 
significantly but was typically much longer, 150 seconds and more. 

6.4- Larger Instances 

We also performed runs on several larger instances of SSN (with = 100, 000 
scenarios) and on some very large instances of the stormG2 problem, a cargo 
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flight scheduling application described by Mulvey and Ruszczyhski |l£ 
interest in this section is more in the sheer size of the problems that can be \) 
solved using the algorithms developed for the computational grid than with the 
relative performance of the algorithms with different parameter settings. 

Table || shows results for a sampled instance of SSN with N = 100,000 
scenarios, which is a linear program with approximately 1.75 x 10 7 constraints 
and 7.06 x 10 7 variables. This run was performed at a time when not many 
machines were available, and many suspensions occurred during the run. We 
chose T = 100 chunks per evaluation and found that most tasks required between 
41 and 300 seconds on the workers, with a few task times of more than 500 
seconds. (The benchmarks indicated that the worker speed varied over a factor 
of 7.) A total of 77 different workers were used during the run, though the average 
number of nonsuspended workers available at any time was only 39. In fact, at 
any given point in the computation there were an average of 7 workers assigned 
to this task that were suspended. Still, a result was obtained in about 22 hours. 

In the stormG2 problem of Mulvey and Ruszczyhski []l9| , the first-stage prob- 
lem contained 121 variables, while each second-stage problem contained 1259 
variables. We considered hrst a sampled approximation of this problem with 
250000 scenarios, which resulted in a linear program with 1.32 x 10 s constraints 
and 315 x 10 8 unknowns. Results are shown in Table M. The algorithm was 
started at a solution of a sampled instance with fewer scenarios and was quite 
close to optimal. The objective function at the initial point was approximately 
15499595.1, compared with an optimal value of 15499591.9 achieved by Algo- 
rithm TR. In fact, the TR algorithm takes only one major iteration — it accepts 
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the 16th minor iteration as the first major iterate a^i The ATR variant doe^, 
not take even one step — it terminates after determining that the initial point <^ 
x° is optimal to within the given convergence tolerance. Although we requested 
250 processors, an average of only 106 were available during the time that we 
performed these two test runs. The second run is able to utilize these to high 
efficiency, as the second-stage workload can be divided into a large number of 
chunks and very little time is spent in solving the trust-region subproblem. 

Finally, we report on a very large sampled instance of stormG2 with N = 10 
scenarios, an instance whose deterministic equivalent is a linear program with 
9.85 x 10 8 constraints and 1.26 x 10 10 variables. Performance is profiled in Table 0. 

We used the tighter convergence tolerance etoi = 10~ 6 for this run. The algo- 
rithm took successful steps at iterations 28, 34, 37, and 38, the last of these being 
the final iteration. The first evaluated point had a function value of 15526740, 
compared with a value of 15498842 at the final iteration. 

For this run, we augmented the Wisconsin Computer Science Condor pool 
with machines from Georgia Tech, the University of New Mexico, the Italian 
National Institute of Physics (INFN), the NCSA at the University of Illinois, and 
the IEOR Department at Columbia, the Albu, and the Wisconsin engineering 
Department. Table ^ shows the number and type of processors available at each 
of these locations. In contrast to the other runs reported here, we used the 
"MW-files" implementation of MW, the variant that uses shared files to perform 
communication between master and workers rather than Condor-PVM. 

The job ran for a total of almost 32 hours. The number of workers being used 
during the course of the run is shown in Figure |. The job was stopped after 
approximately 8 hours and was restarted manually from a checkpoint about 2 
hours later. It then ran for approximately 24 hours to completion. The number 
of workers dopped off significantly on two occasions. The drops were due to the 
master processor "blocking" to solve a difficult master problem and to checkpoint 
the state of the computation. During this time the worker processors were idle, 
and MW decided to release a number of the processors rather than have them 
sit idle. 

As noted in Table [t], an average of 433 workers were present at any given point 
in the run. The computation used a maximum of 556 workers, and there was a 
ratio of 12 in the speed of the slowest and fastest machines, as determined by the 
benchmarks. A total of 40837 tasks were generated during the run, representing 
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Table 8. Machines available for stormG2, with N = 10 7 scenarios. 
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Fig. 1. Number of workers used for stormG2, with N = 10 7 scenarios. 

3.99 x 10 8 second-stage linear programs. (At this rate, an average of 3472 second- 
stage linear programs were being solved per second during the run.) The average 
time to solve a task was 774 seconds. The total cumulative CPU time spent by 
the worker pool was 9014 hours, or just over one year of computation. 

7. Conclusions 

We have described L-shaped and trust-region algorithms for solving the two- 
stage stochastic linear programming problem with recourse, and derived asyn- 



Stochastic Programming on a Computational Grid 



43 



chronous variants suitable for parallel implementation on distributed heteroge- 
neous computational grids. We prove convergence results for the trust-region 
algorithms. Implementations based on the MW library and the Condor system 
are described, and we report on computational studies using different algorith- 
mic parameters under different pool conditions. Becasue of the dynamic nature 
of the computational pool, it is impossible to arrive at a "best" configuration 
or set of algorithmic parameters for all instances. Instead, it may be important 
to adjust the algorithm parameters dynamically; we suggest this as a line of fu- 
ture research. Finally, we report on the solution of some large sampled instances 
of problems from the literature, including an instance of the stormG2 problem 
whose deterministic equivalent has more than 10 10 unknowns. Since the use of 
the computational grid has the greatest benefit on problems that require large 
amounts of computation, the algorithms developed here are best suited to larger 
(multistage) problems or incorporated into a sample average approximation ap- 
proach (see Shapiro and Homem-de-Mello 
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